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Abstract 



Looking for associations among multiple variables is a topical issue in statistics due to 
the increasing amount of data encountered in biology, medicine and many other domains 
involving statistical applications. Graphical models have recently gained popularity for 
' this purpose in the statistical literature. Following the ideas of the LASSO procedure 

C/3 , designed for the linear regression framework, recent developments dealing with graphical 

model selection have been based on £i-penalization. In the binary case, however, exact 
inference is generally very slow or even intractable because of the form of the so-called 
^ ' log-partition function. Various approximate methods have recently been proposed in the 

, literature and the main objective of this paper is to compare them. Through an extensive 

' simulation study, we show that a simple modification of a method relying on a Gaussian 

, approximation achieves good performance and is very fast. We present a real application 

• ' in which we search for associations among causes of death recorded on French death 

[ certificates. 

o 

1 Introduction 



^ ■ In biology, medicine, and many other domains of statistical application, researchers are in- 

creasingly faced with problems involving numerous variables and a natural problem is that 
of studying their relationships. Standard examples are the construction of social or commu- 
nication networks and systems biology. When the underlying variables are binary (which is 
the focus of this paper), a classical way for stu dying their rela t ionships is to use Poisson lon - 
linear models for multiway contingency tables (Agresti, 19901 ) ( Mccullagh and Nelderl . 19891 ). 



How to perform selection in log-linear models, or equivalently in binary graphical models, 
depends upon the number of variables p. Indeed, the total number of cell entries for a p-way 
contingency table is 2^, and the total number of free parameters in the associated saturated 
model is 2^ — 1. When p is low, a standard approach for model selection is greedy stepwise 
forward-selection or backward-deletion: in each step, selection or deletion is based on hy- 
pothesis testing at some level a. However, the computational complexity even for modestly 
dimensioned contingency tables plus the multiple hypothesis testin g issues related to such 



a procedure has made it unpopular in this context. Consequently, iDahinden et al.l (j2007l ) 
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recently performed selection in log-linea r models by using an ^i-penalized version of the log- 
likelihood, extending the LASSO ideas ( Tibshiranil . 119961 ) originally designed for selection in 
linear models. However, computing the (penalized) log-likelihood for log-linear models gen- 
erally requires the enumeration of each of the 2^ profiles, which is not plausible for large p 
(e.g., larger than about 30). For such moderate-to-large values oi p, alternative methods are 
required. 

Roughly speaking, two approaches have been proposed in the literature. First, exact inference 
can be performed in the case of highly sparse models. For instance, exact computation via the 
junction tree algor i thm is manageable for highly sparse graphs but becomes unwieldy for dense 



graphs ( Lee et al. . 20071 ). The second approach is to use approximate inference. Notably, 



much attention has rec e ntly b een p aid to metho d s rely ing on proxies for the exact likelihood. 
Hofiing and Tibshiranil (|2009l ) and IWang et al.l toid ) proposed two distinc t algor ithms to 
maximize an £i-penalized version of the so-called vseudo-lik elihood (IBesad. Il975l). These 
methods are closely related to the one formerly proposed by IWainwright et al. who 
used £i-penalized logistic regression s on each single nod e to construct the whole graphical 
model. Besides these three methods, Banerjee et al. ( 20081 ) used a convex relaxation technique 
to derive a Gaussian approximate log-likelihood as well as its sparse maximum solution. 

Interestingly, Hofiing and Tibshirani ( 20091 ) showed through an extensive simulation study 
that approximate solutions (ei ther solutions max i miziii g the pseudo-likelihood or those derived 
from the method proposed by Wainwright et al. ( 20061 )) are much faster and only slightly less 
accurate than exact r nethods. Howev e r, no empirical evaluation of the Gaussian approximate 
solution proposed by Banerjee et al. ( 20081 ) has ever been conducted and filling this gap is 



the primary objective of this paper. We thereby propose to condu ct such an evaluat i on b; 



com paring this method with the t wo other approximate methods of IWainwright et al.l (120061 ) 
and Hofiing and Tibshirani ( 20091 ) on simulated data. 



Here is a brief outline of the paper. In Section [2] we first summarize the principles of the 
aforementioned approximate methods. We then present results froi n an extensive e mpiri cal 
comparison study. A slight modification of the method proposed by Baneriee et al. ( 20081 ) is 
especially shown to achieve very good accuracy and to be extremely fast. Finally, we present 
an application from a real example where we looked for associations among causes of death 
in the database of French death certificates of the year 2005 (Section H]). 



2 Approximate methods for binary graphical models 
2.1 The Ising model 

Let X = (X«,...,X(P)) G {0,1}P be a p-dimensional vector of binary random variables. 
Given a random sample Xi, ...,X„ of i.i.d. replicae of X, we wish to study the associations 
between the coordinates of X. One way to do so is to construct the binary graphical model 
for the random vector X, that is an undirected graph Q = (V, E), where V contains p vertices 
corresponding to the p coordinates and the edges E = {ek/)i<k<e<p describe the conditional 
independence relationships among The edge eke between X^^^ and is ab- 

sent if and only if X*^^-* and X^^'^ are independent conditional on the other variables. For binary 
graphical models, it is common to focus only on the family of p r obabi l ity distribution s given 
by the quadratic exponential binary model (jWainwright et al.l . l2006l : iBaneriee et al.l . l2008l : 
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Hofling and Tibshiranil . l2009l : IWang et alJ . l20ld ) , also known as the Ising model. Namely, for 
allx= G {0,1}^, we assume that the probability of x is given by 



P(x,e) = exp{^0fcx('=) + ^ J2 Ok., 



k=l k=li=k+l 

where the so-called log partition function A{-) is defined as follows 

p p— 1 p 

k=l k=li=k+l 



A{Q)}, 



A(G) = log exp{£0,x('=) + ^ £ 

xG{0,l}P ■ - ^ - 



Y 



Note that A{-) is strictly convex and ensures that X]xe{o,i}p ^(x, G) = 1 (note also that the 
strict convexity of this function ensures the identifiability of the parameter matrix 0). From 
([1]), we have 



exp(fyA,.,. 



(3) 



Therefore, the parameters O^^i are the conditional log-odds ratios. The conditional indepen- 
dence between X*^^) and X^^^ is then equivalent to 9k,i = 0, that is, the edge eu/ is absent 
if and only if 9^^^ = 0. Consequently, selection in binary graphical models is equivalent to 
identifying the {k,l) pairs for which ^ = 0. 

From ([1]), using the fact that x^^^ is binary, and denoting hy X = (Xi, ...,X„)"^ the matrix 
representing the whole dataset, the ^i-penalized log-likelihood writes 



/(^,e)= {X'^X)k,iek/-nA{Q)-nX\\Q\\i, 



(4) 



t>k>l 



where is a symmetric matrix with O^k = Gk for k = However, because of the 

complexity of the log-partition function, methods based on approximate inference are needed 
in most cases and we recall the principle of three of them in the following paragraphs. 

Let us begin by recalling that the approximation established by Banerjee et al. ( 20081 ) is 
only valid for the f irst-order- interaction log-linear mode l des cribed above . On t he other hand, 
Wainwright et al. (|2006l ). iHofiing and Tibshiranil (|2009l ) and lWang etHI (|2010l ) also only con- 
sider this simple model but higher-order interaction models can be (at least theoretically) 
handled with these methods, at a cost of a dramatically increased computational time. 



2.1.1 Multiple logistic regressions 



(x 



(1) 



„(fe-i) ^(fc+i) 



From ([T]), it is easy to see that, for all k = 1, setting x*^ ^'^ 
x'^P^), we have 

logit{P(x('=) = l|x(-'=))} = Y^k/x^^^ + Ou. 

ti^k 

Wainwright et al. (j2006l ) then extensively study a method (which w ill be referred hereafter 



(5) 



as SepLogit following the terminology adopted in IWang et al.l in which ^i-penalized 

logistic regression is used to estimate the neighborhood of each of the p nodes in the graph 
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separately. Wainwright et al. ( 20061 ) gives rigorous consistency results in a high-dimensional 
setting, where the number of nodes is allowed to grow as a function of the number of samples. 
The authors give sufficient conditions under which the method will consistently estimate the 
neighborhood for every node in the graph simultaneously. In a sense, the paper can be seen 
as a discrete version of Meinshausen and Biihlmann ( 20061 ). 



For a finite number of samples, the p logistic regression problems are solved separately and, 
since the results may be asymmetric, they can be combined in one of two ways to draw a 
graph. One possibility is to draw an edge between two nodes in the graph only if each node is 
estimated to belong to the neighborhood of the other (method SepLogit AND). Alternatively, 
we can decide to draw an edge between two nodes so long as at least one of them is estimated 
to belong to the neighborhood of the other (method SepLogit OR). 

In our empirical comparison st udy, this method will b e implemented using the coordinate 
descent procedure developed by Friedman et al. ( 2008bl ) (and implemented in the glmnet R 
package) . 



2.1.2 Pseudo-likelihood maximization 



One of the shortcomings of the method propose d bvlWainwright et al.l (120061) is th e aforemen- 
tioned asymmetry. To overcome this limitation, Hofling and Tibshirani (2009) and^Wang et ^1 

recently proposed to use the pseudo-(log-)likelihood, first suggested by iBesag (jl97 
The pseudo-likelihood is formally defined as 



5:^iog{p(xr^ix 



(fc)l 



^(fc+1) 



(6) 



=1 k=l 



Accordingly, the approach based on the maximization of the ^i-penalized pseudo-likelihood 
solves all p logistic regression problems simultaneously, while enforci ng symmetry. Apa r t fron i 
symmetry enforcement, this methods differs from the one studied in IWainwright et al.l (j2006l ^ 
in that the £i-norm penalty is applied to the entire network, while in SepLogit it is applied 
to each neighborhood. 

For future use, and still denoting by X = (Xi, X„)"^ the matrix representing the whole 
dataset, observe that the pseudo-likelihood writes 

n p 

pseudo-/(Af, e) = - ^ ^ log {1 + exp(-xf ) Af'^fi, ]e[, k])}, 



(7) 



i=l k=l 



where X is the same as X with A;th column set to 1 and x) 

Ik) 



(k) 



2x) 



(k) 



1 i.e., 



is the spin 



version of x\ Here and elsewhere, M[i,] (resp. M[,k]) denotes the i-th row (resp. k-th 
column) of a matrix M. 

Hofling and Tibshiranil (|2009l ) first develop and implement an algorithm for maximizing the 
•^i-penalized pseudo-likelihood function, using a local quadratic approximation to the pseudo- 
likelihood. They then use this algorithm as a building block for a new algorithm that maxi- 
mizes the true log-likelihood. However, as we already said, they observed that the approximate 
pseudo-likelihood is much faster than the exact procedure, and only slightly less accurate. 
Therefore, to save computational time, we only considered the approximate pseudo-likelihood 
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in this paper. In the forthcoming empirical comparison study, this method wih be imple- 
mented using the BMN R package and will be referred to as BMNPseudo. 

Interestingly, and as pointed out by Hofiing and Tibshirani ( 20091 ). the derivative of the 
pseudo-likelihood on the off-diagonal is roughly twice as large as the derivative of the exact 
likelihood. Moreover, in the case p = 2, it is easy to see that the deviance of the model with 
no association (i.e. minus twice the difference between the log-likelihood of this model and the 
log-likelihood of the saturated model) when computing with the pseudo-likelihood is twice as 
large as the one computed with the exact likelihood (while, obviously, the pseudo-likelihood 
coincides with the exact likelihood for the model with no interaction). The generalization of 
this striking result for higher p is not straightforward, but our empirical examples suggest it 
may hold at least approximately (see Section [3T3|) . Therefore, we will consider methods relying 
on both pseudo-/(A', 6) (method BMNPseudo) and pseudo-/(Af, 6)/2 (method BMNPseudo 1/2) 
in the sequel. 

For the sake of completeness, we shall add that IWang et al.l \20ld ) develop a gradient-descent 
algorithm to maximize the ^i-penalized pseudo-likelihood. They further propose an extension 
to account for spatial correlation among the variables (which was relevant in their exam- 
ple dealing with genomic data). However, the corresponding LogitNet R package was not 
available at the time we wrote this paper, so we were not able to include it in our empirical 
comparison study. 



2.1.3 Gaussian Approximation of the Ising log-likelihood 



The basic idea of the method described bv lBaneriee et al.l (120081) is to replace the log-p a rtitio n 
function in the Ising model with an upper bound suggested by Wainwright and Jordan ( 20061 ). 
The resulting approximation can then, with some manipulation, be put in a form that can be 
solved efficiently using block coordinate descent. In order to add some specific details, we shall 
define some notation. Denote by (Zi,...,Z„) S {—1,1}^^" the spin version of (Xi,...,X„), 

and let Z denote the sample mean of variable Z' ', for k = 1, ...,p. Now, define the empirical 
covariance matrix S as 



5 = -V(Z,-Z)(Zi-Z) 



n 



(8) 



where Z is the vector of sample means Z' Making use of a convex relaxation and a use- 
ful upper bound on the log-partition function obtained by IWainwright and Jordan (|2006l ). 
Baneriee et al. ( 20081 ) established that an approximate sparse maximum likelihood solution 
for a given A has the following form 



hi 



Z 



{k) 



-{'^x^)ke, 



(9) 



where the matrix S > ^ is the solution of the following optimization problem 



argmax{log |M| - tr(M(5 + diag(l/3))) 



|M||i}. 



(10) 
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More precisely, Baneriee et al. ( 20081 ) proposed a block-coordinate descent algorithm to solve 



a dual formulation of (1101). which can be written as 



±x = argmax { log \W\ : Wkk = Skk + \. \Wm - Sm\ < A}. (11) 
W 6 



In the Gaussian case, iBaneriee et al. I (I2OO8I ) showed that the £i-penalized covariance selection 



problem could be written 



argmax{log|Vr| : ||VF-5g||oo < A}, (12) 
w 



where So is the empirical covariance matrix attached to a given sample of Gaussian vectors. 
An algorithm for handling binary graphical models can be derived by comparing (jlip and 
(|12p . The original {0,1} data has first to be transformed into {—1,1} data. Then, adding 
the constant 1/3 to the diagonal elements of the resulting empirical covariance matrix, the 
algorithms deve l oped in the Gaussian case (in particular the glasso R package developed by 
Friedman et"all (|2008al l^ can be reused. 



A common question when working with Gaussian variables is whether to standardize them, or 
equivalently, whether to use the covariance or the correlation matrix. Moreover, in the binary 
case, the correlation coefficient between two variables (also known as the (/(-coefficient) is 
closely related to the statistic used to test for (marginal) independence. Putting these two 
observations together , we decided to evaluate a simple modification of the method proposed by 
Banerjee et al. ( 20081 ) where the quantity 5+diag(l/3) is replaced by the correlation matrix. 



Lastly, we also decided to evaluate the modification in which 5+diag(l/3) is simply replaced 
by 5. 

To recap, we will consider the three following optimization problems 

= argmax{log|M| -tr(M5^) - A||M||i}, for u = 1,2,3, (13) 

where = (Cov(Z) + diag(l/3)), = Cov(Z) and = Cor(Z). For any A, and every 

= 1, 2, 3 an estimation of Ok/ is then given by —{C^)ke- 
In our empi rical comparison study, the three methods will be implemented using the glasso 
R package ( Friedman et al.l . l2008al ) and will be referred to as GaussCov 1/3, GaussCov and 



GaussCor for the choices = 1, = 2 and = 3 respectively (we may as well use the generic 
expression Gauss Approx when dealing with either methods). 

2.2 Sparsity parameter selection 

Two procedures for selecting tuning parameters are generally considered, namely cross-validation 
(CV) and Bayesian In formation Criterion ( BIG), the latter being computationally more ef- 



ficient as suggested by Yuan and Lin (2003) for instance. In the case of Gaussian graphical 



models, iGao et all t00% further demonstrate the advantageous performance of BIG for spar- 



sity parameter selection through simulation studies. In this paper, we therefore decided to 
only consider BIG. 

When trying to select the optimal sparsity parameter A using either CV or BIG, however, one 
has to pay attention to the following fact. Since, for each A > 0, estimates of the parameters 
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of interest a re shrunk, using t hem for choosing A from CV or BIG often results in severe 
over-fitting ( Efron et al. . 20041 ) . Therefore, un-shrunk estimates have to be derived before 



computing the BIG. 

Taking the example of methods GaussApprox, for any A, we have to compute the un-shrunk 



matrix 



= argmax { log \M\ - tiiMS")}. (14) 



Here, = {M y : Mk/ = for couples {k,£) such that {C^)k,e = 0} ({M y 0} being the 
set of positive definite matrices of order p, and C'^ being as in ([9])). To solve the optimization 
problem ()14p . one approach is to reuse the algorithm used to solve (I13p after replacing the 
scalar parameter A by the penalty matrix A such that A^^^ = if ^ 0, and A^^^ = oo 

otherwise, where 9^^ is the shrunk estimation of the coefficient 9k/ obtained with the value 
A and is used as an initial value for the opti mization. Altern ative approaches might be 
considered, such as the algorithm developed by Dahl et al. ( 20081 ) for instance. 



Defining 

£^ = log|C,n-tr(COT, 
the BIG procedure now simply corresponds to selecting the sparsity parameter Agj^ such that 

A^IC = arg max {nCi - K'^ log(n)}, (15) 

where = J2k>i ^ ^C^)k/ ^ 0| i s the degree of freedom of the model selected with the 
sparsity parameter A (j Yuan and Lin . I2OO7I ). 



2.3 Estimation of the conditional odds-ratios 

In the binary case, a standard measure of the strength of association between two variables is 
the (conditional) odds-ratio, which is related to coefficient 9k/ (see ([3])). Therefore, consistent 
estimates of the parameters 9^/ would yield consistent estimates of the conditional odds- 
ratios. Here again un-shrunk estimates are preferable, and the methods described in the 
previous section have to be used. 



3 Simulation study 

In this section, we compare the model selection performances as well as the computational 
time for the methods described in the previous Section. Results are presented for p = 10 and 
p = 50. The choice p = 10 has been made for several reasons. First, for such low values 
of p the true log-likelihood can be quickly computed, and we can then compare it with the 
approximate log-likelihoods (see Section [3^ below). Second, approximate methods are still 
faster to compute when p is small, and conclusions drawn in the case of low p are likely to 
hold for high p as well (as will be confirmed from our results) . 



3.1 Evaluation criteria 

For each method, every value of A corresponds to a sparsity structure for the matrix that 
can be compared with the true sparsity structure. Namely, for all A and for each method. 
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we can compute the rate of true positives (correctly identified associations), the rate of false 
positives (incorrectly identified associations) as well as the overall accuracy. Precision and 
recall (the latter being identical to the true positive rate) can also be computed, as well as 
their harmonic mean, often referred to as the Fl-score. 

In a first evaluation study, we present for each method the performances achieved by the 
"oracle" model, that is the model constructed with optimal sparsity parameter regarding 
accuracy. Such an evaluation was not conducted for SepLogit because under this method 
the ii penalty is applied to each neighborhood and the "oracle" model would invariably 
coincide with the true model. The alternative would be to force the algorithm to choose the 
same sparsity parameter value for every regression model. However, using this alternative 
approach, we sometimes obtained "oracle" models that achieved performances worse than 
the models selected by the BIC procedure. Therefore, we do not recommend to force the 
algorithm to choose the same sparsity parameter value for every regression model. 

In a second evaluation study, we present for each method the performances achieved by the 
model selected according the BIC procedure described above. For methods GaussApprox, 
un-shrunk estimates were derived along the lines described in Section [2]2j A similar approach 
was used for method BMNPseudo. The BMN R package also allows the use of a matrix of penalty 
coefficients. For SepLogit, we had to slightly adapt this approach because the glmnet package 
does not allow for building models with only un-penalized coefficients. So, whenever needed, 
a standard logistic regression model was used to get un-shrunk estimates. This may make 
the method a little slower, but not much since for each variable, this situation can only arise 
for the smallest tested A value, and only if this smallest tested A value corresponds to the 
saturated model. 

In addition, the computational time is reported. More precisely, we used a grid [A™™ := 
A™^^/1000, A™^^] of 50 equally-spaced values (on a log-scale) for the parameter A and we 
report the time needed to compute the 50 corresponding models for each method (A™^-^ is 
the data derived smallest value for which all coefficients are zero). Each method was run on a 
Windows Vista machine with Intel Core 2DU0 2.26GHz with 4GB RAM in the case p = 10 
and on a MAC Pro machine with Intel Xeon 2x2.26GHz Quad Core with 6GB RAM in the 
case p = 50 (the MAC Pro machine was approximately 3.5 times as fast as the Windows 
machine) . 

3.2 Data generation 
3.2.1 The case p = 10 

In model ([1]), given that n individuals are observed, the distribution of the corresponding cell 
counts n = {n^,x € {—1,1}^) is multinomial with probability P = (P(x, B),x E { — 1,1}^). 
Accordingly, given a symmetric matrix 0, data were drawn from the multinomial distribution 
with probability vector P. Four matrices Q^^\ Q^'^\ Q^^^ and 0*^^-* were considered, leading 
to four different simulation designs. 

For e(^), " primary" coefficients 9k i were simulated independently using a normal distribution 
with mean zero and variance a for some u > 0. Subsequently, only coefficients 9^/ with an 
absolute value greater than 0.06 (corresponding to a conditional odds-ratio of exp(4 x 0.06) ~ 
1.27, since for {—1,1} variables, the conditional odds-ratio is exp(40fc^^)) were retained, all 
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others being set to 0. The function A{Q^^^) was then computed according to Equation ([2]). 
Selecting a = 0.05 led to a true model with 10 associations (among the p{p — l)/2 = 45 
potential associations) . 

Matrices G^^) and 6^^) were constructed so that they share the same sparsity pattern as 

e(i), 

i.e., 

{(kj) : eg = 0} = {{k,i) : eg = 0} = {{k,i) : eg = 0} 

but they have different non-zero coefficients. In either cases, we selected (^i^i, eio,io) = 
(-1.3, ...,0). For matrix B^^), the non-zero 6^/ coefficients were set to ±0.2, while they were 
set to ±0.4 for matrix Q^^\ 

For matrix Q^^\ we proceeded as for matrix 0*^^^ but we selected a = 0.3 and only the 6k/ 
coefficients with an absolute value greater than 0.2 were retained (the others being set to 
0). Moreover, we selected (^i^i, ^10,10) = (— 1.8, 0). This led to a true model with 19 
associations. 

A graphical representation of matrices G*-^-*, Q^'^\ Q^^^ and 0*^^^ as well as the corresponding 
marginal probabilities W{X^^^ = 1), for k = 1, 10, estimated on a sample of size n = 2500 
are presented on Figure [H 

3.2.2 The case p = 50 

For p = 50, we first considered the case of block-diagonal matrices G. For j = 1,2,3,4, we 
then used matrices of the form diag(0-', ,@^). 

In a fifth example, matrix Q^^^ was build as follows. For every k > £ > 1, we first draw one 
observation u from a (O,l)-uniform distribution, and eg was then set to if u < 0.9, log(2) if 
u > 0.95 and log(1.5) otherwise. The resulting true model consisted of 125 edges. Coefficients 
e^ ^, were set to (logit(0.1),...,logit(0.2)). Gibbs sampling was further used to generate the 
data (consisting of {0, 1} variables this time). 

3.3 Empirical comparison of the approximate deviances 

In this section, our goal is to empirically evaluate the approximate likelihoods on which the 
methods under study rely. To do so, we will focus on the case where p = 10 since the exact 
log-likelihood of the Poisson log-linear model can be computed for such a value of p. For each 
of the four matrices described above, we proceed as follows. We generate a random sample 
of size n = 500, and for each value of the tuning parameter A on an appropriate grid, we apply 
method BMNPseudo. This leads to some sparsity structure in the corresponding Ising model 
and we can then compute the Gaussian approximate log-likelihoods C^" , v = 1, 2, 3 (see 
below) as well as both the exact Poisson log-linear log-likelihood and the pseudo-likelihood 
for the Ising model corresponding to this particular sparsity structure. More precisely, the 
following quantities were considered, 

n p 

4' = -EEl°g{l + «^PMf^'^'[^']0A[,A:])} (16) 

i=l k=l 

n 

4° = $:iog{p(x,,0p°)} (17) 

i=l 



9 




2 4 



8 10 




2 4 



8 10 




8 10 




2 4 



8 10 



10 



Figure 1: Marginal probabilities 'P{X^'^'> = 1), for k = 1, 10, estimated on a sample of size 
n = 2500 generated using the matrix G = = 1,2,3,4 are presented in the left panels. 

A graphical representation of matrix = 1,2,3,4 is given in the right panels. 




10 20 30 40 50 



Figure 2: Marginal probabilities P(X(^^ = 1)? ^or k = 1, 50, estimated on a sample of size 
n = 2500 generated using the matrix B = G^^-* are presented in the left panel. A graphical 
representation of matrix 0^^^ is given in the right panel. 

= log \C^\ - ti{C^S^) for u = l,2, 3. (18) 

In (jl6p . ©A stands for the un-shrunk matrix derived under the sparsity structure inferred 
from method BMNPseudo with the sparsity parameter value A, and xf'^ and are as in ([7]). 
In ()17p . is the matrix of coefficients obtained using a Poisson log-linear model under 
the constrained induced by this sparsity structure, and P(x, G) is as in ([T|). Lastly, in (jlSp . 
Si = (Cov(Z) + diag(l/3)), 52 = Cov(Z), S3 = Cor(Z) and is defined as 

= arg max {log |M| - tr(MS',,)}, 
where Ai^ = {M >~ : M^^ = for couples {k,£) such that (9A)fc^ = 0} (with @\ as in 

m)- 

Figure [3] shows the corresponding deviances. It can be seen that using the Gaussian ap- 
proximate log- likelihood based on the covariance matrix with the additional 1/3 term on the 
diagonal results in a deviance which is quite far from the exact one. Furthermore, the deviance 
obtained with the covariance matrix (without adding the 1/3 term on the diagonal) equals 
that obtained with the correlation matrix, and both are closer to the exact deviance. Finally, 
the deviance of the pseudo-(log)-likelihood is always greater than the exact deviance. Using 
half the pseudo-likelihood corrects this undesirable effect in most cases. 

These results should obviously be considered with caution. Even if we tried to use various 6 
matrices to generate the data (and the conclusions were consistently the same), a theoretical 
study would be needed to confirm these empirical findings. 

3.4 Performance evaluation; the case p = 10 

Let us first consider the performances achieved by the oracle models (Tables [Hand [2]). Overall, 
methods BMNPseudo and GaussCor achieve good performances in terms of accuracy and Fl- 
score. It is also noteworthy that the computational time is much higher for BMNPseudo, while 
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Figure 3: Approximates for the Ising deviance. Deviances were computed for each model 
selected by BMNPseudo for various values of the sparsity parameter, on samples of size 
n = 500 generated with matrices 0^^^ (upper left corner), 0^^^ (upper right corner), 6(3) (lower 
left corner), and 0^'^) (lower right corner). Deviances were computed using the exact log-linear 
log-likelihood (solid black line, solid circles), the pseudo- likelihood (dashed blue line, circles), 
half the pseudo- likelihood (solid blue line, solid circles), and the Gaussian approximate log 
likelihoods based on the covariance matrix with an additional 1/3 term on the diagonal (dotted 
green line, circles), the covariance matrix (dashed green line, squares) and the correlation 
matrix (solid green line, solid circles) (see (fT6]l - (fT8]l for the corresponding formula). 
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overall performances of GaussCor appear to be slightly higher (especially under the fourth 
simulation design). 

When focusing on the three GaussApprox methods, we observed the following ranking 

GaussCor > GaussCov > GaussCov 1/3. 

Consequently, and to save space, methods GaussCov and GaussCov 1/3 will not be considered 
in the evaluation of the models selected via the BIC procedure. It is still interesting to note 
that GaussCor>GaussCov although we observed that the approximate deviances under these 
two methods were equal and close to the exact ones, which turn out to be an insufficient 
condition for achieving good performances. 

Turning our attention to the evaluation of models selected with the BIC procedure (Tables [3] 
andH]), a first observation is that, as suggested by the results of Section [331 computing the 
BIC with half the pseudo-likelihood (rather than the pseudo-likelihood itself) results in better 
models in most cases. Moreover, from the comparisons of the results of Tables [1] and [2] and 
Tables [3] and HI as n grows, the BIC procedure appears to enable to select models achieving 
performances similar to those achieved by the "oracle" models. Moreover, the computation of 
the un-shrunk estimates with method BMNPseudo appears to be very slow (the oracle models 
were much faster to compute than the models selected by the BIC approach for this particular 
method, especially when the sample size is small and in the fourth simulation design). 

Overall, SepLogit OR, SepLogit AND and GaussCor are the best methods, closely followed 
by BMNPseudo 1/2. Lastly, among these candidate methods, GaussCor is the fastest. 

We should lastly mention that method SepLogit was further tested using standardized co- 
variates in each £i-penalized logistic regression models (results not shown). To motivate this 
choice, we may mention that this is the default option in pac kage glmnet as th is approach is 



often adopted in applications when using ^i-penalization (see lKoh et al.l (j2007l ) for instance); 
its suitability in our context of binary variables was yet questionable. Interestingly, this ap- 
proach yielded results very similar to those obtained via the " standard" one on data generated 
using matrices Q^^\ 0^^^ and Q^^^ and slightly better when using matrix O^^^ (which however 
corresponds to the situation where we observed the greatest variability in the performances 
of every method) . 



3.5 Performance evaluation; the case p = 50 

To save space, we only present here the performances achieved by models selected via the 
BIC procedure, on samples of size n = 500 and n = 2500 generated using either matrix 
diag(e(^),e(^),G(j),e(-'),e(j)), for j = 1,2,3 or matrix e(^). Moreover, the results obtained 
in the case p = 10 especially show that method BMNPseudo can be quite slow, and that it 
does not outperform method SepLogit. Lastly, among the methods relying on a Gaussian 
approximate of the Ising likelihood, method GaussCor was observed to be the best. Therefore, 
in order to save computational time, only methods GaussCor and SepLogit were considered 
in the case p = 50. 

Results are presented in Table [H They are consistent with what was observed in the case 
p = 10. More precisely, methods SepLogit and GaussCor achieve comparable performances. 
Regarding computational time, GaussCor is still significantly faster than SepLogit. 
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Table 1: Evaluation of the "oracle models"; the case p = 10. Means (computed over 50 runs) 
are given for the computational time needed to compute the models on a grid of 50 equally- 
spaced A values as well as the number of detected associations (POS), the false positive rate 
(FPR), the true positive rate (TPR, which equals the recall, REC), the precision (PRE), the 
accuracy (Acc.) and the Fl score corresponding to the "oracle" model. 



Method 


Time (s) 


POS 


FPR 


tprt 


PRE 


Acc. 


Fl score 




Data generated with © = 


e(i) 






n = 100 
















BMNPseudo 


6.47 


3.22 


0.034 


0.202 


0.724 


0.796 


0.291 


GaussCov 1/3 


0.48 


1.84 


0.010 


0.150 


0.935 


0.804 


0.214 


GaussCov 


0.48 


2.02 


0.012 


0.160 


0.933 


0.804 


0.220 


GaussCor 


0.48 


1.98 


0.011 


0.158 


0.937 


0.804 


0.219 


n = 500 
















BMNPseudo 


20.20 


6.96 


0.031 


0.588 


0.883 


0.884 


0.676 


GaussCov 1/3 


0.72 


7.02 


0.033 


0.588 


0.877 


0.883 


0.674 


GaussCov 


0.72 


7.00 


0.031 


0.590 


0.881 


0.884 


0.677 


GaussCor 


0.72 


7.02 


0.031 


0.592 


0.881 


0.885 


0.678 


n = 2500 
















BMNPseudo 


82.65 


9.48 


0.003 


0.938 


0.991 


0.984 


0.961 


GaussCov 1/3 


0.96 


9.36 


0.002 


0.928 


0.992 


0.982 


0.957 


GaussCov 


0.96 


9.40 


0.002 


0.932 


0.992 


0.983 


0.959 


GaussCor 


0.96 


9.40 


0.002 


0.932 


0.992 


0.983 


0.959 




Data generated with © = 


e(2) 






n = 100 
















BMNPseudo 


20.17 


3.62 


0.024 


0.278 


0.864 


0.821 


0.376 


GaussCov 1/3 


0.54 


3.72 


0.025 


0.284 


0.867 


0.821 


0.376 


GaussCov 


0.54 


3.92 


0.028 


0.294 


0.857 


0.821 


0.386 


GaussCor 


0.55 


3.60 


0.025 


0.274 


0.870 


0.820 


0.366 


n = 500 
















BMNPseudo 


23.16 


7.02 


0.014 


0.654 


0.950 


0.912 


0.755 


GaussCov 1/3 


0.79 


7.44 


0.02 


0.674 


0.927 


0.912 


0.761 


GaussCov 


0.79 


7.46 


0.019 


0.678 


0.93 


0.913 


0.765 


GaussCor 


0.86 


8.24 


0.017 


0.766 


0.938 


0.935 


0.832 


n = 2500 
















BMNPseudo 


83.76 


9.74 


0.004 


0.960 


0.987 


0.988 


0.972 


GaussCov 1/3 


0.96 


9.86 


0.008 


0.958 


0.975 


0.984 


0.964 


GaussCov 


0.97 


9.70 


0.005 


0.954 


0.985 


0.986 


0.967 


GaussCor 


0.99 


10.04 


0.002 


0.998 


0.995 


0.998 


0.996 



t TPR=REC. 
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Table 2: Evaluation of the "oracle models"; the case p = 10. Means (computed over 50 runs) 
arc given for the computational time needed to compute the models on a grid of 50 equally- 
spaced A values as well as the number of detected associations (POS), the false positive rate 
(FPR), the true positive rate (TPR, which equals the recall, REC), the precision (PRE), the 
accuracy (Acc.) and the Fl score corresponding to the "oracle" model. 



Method 


Time (s) 


POS 


FPR 


tprt 


PRE 


Acc. 


Fl score 




Data generated with © = 








n = 100 
















BMNPseudo 


35.08 


6.26 


0.018 


0.564 


0.929 


0.889 


0.678 


GaussCov 1/3 


0.72 


6.54 


0.021 


0.582 


0.921 


0.891 


0.687 


GaussCov 


0.72 


6.60 


0.022 


0.584 


0.917 


0.891 


0.689 


GaussCor 


0.78 


7.66 


0.028 


0.668 


0.893 


0.904 


0.746 


n = 500 
















BMNPseudo 


21.02 


8.76 


0.011 


0.838 


0.966 


0.956 


0.889 


GaussCov 1/3 


0.84 


8.92 


0.027 


0.796 


0.913 


0.933 


0.836 


GaussCov 


0.87 


8.88 


0.017 


0.828 


0.946 


0.948 


0.874 


GaussCor 


0.96 


9.68 


0.006 


0.948 


0.981 


0.984 


0.963 


n = 2500 
















BMNPseudo 


80.60 


10.02 


0.002 


0.994 


0.993 


0.997 


0.993 


GaussCov 1/3 


0.92 


10.84 


0.035 


0.962 


0.894 


0.964 


0.923 


GaussCov 


0.97 


10.10 


0.007 


0.984 


0.976 


0.991 


0.979 


GaussCor 


0.99 


10.00 


0.001 


0.998 


0.998 


0.999 


0.998 




Data generated with © = 








n = 100 
















BMNPseudo 


39.89 


5.46 


0.055 


0.212 


0.816 


0.635 


0.312 


GaussCov 1/3 


0.60 


6.84 


0.084 


0.245 


0.775 


0.633 


0.333 


GaussCov 


0.60 


5.16 


0.053 


0.199 


0.800 


0.631 


0.297 


GaussCor 


0.60 


6.86 


0.079 


0.253 


0.761 


0.639 


0.357 


n = 500 
















BMNPseudo 


183.22 


12.94 


0.084 


0.566 


0.859 


0.768 


0.668 


GaussCov 1/3 


0.68 


12.36 


0.08 


0.541 


0.853 


0.760 


0.651 


GaussCov 


0.69 


13.02 


0.089 


0.563 


0.840 


0.764 


0.665 


GaussCor 


0.75 


13.64 


0.055 


0.643 


0.905 


0.818 


0.745 


n = 2500 
















BMNPseudo 


255.87 


15.06 


0.058 


0.714 


0.906 


0.846 


0.796 


GaussCov 1/3 


0.76 


14.68 


0.072 


0.674 


0.880 


0.820 


0.759 


GaussCov 


0.78 


15.16 


0.072 


0.700 


0.884 


0.832 


0.778 


GaussCor 


0.85 


15.70 


0.037 


0.776 


0.944 


0.884 


0.848 



t TPR=REC. 
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Table 3: Evaluation of the models selected by the BIC procedure; the case p = 10. Means 
(computed over 50 runs) are given for the computational time needed to compute the models 
on a grid of 50 equally-spaced A values as well as for the number of detected associations 
(POS), the false positive rate (FPR), the true positive rate (TPR, which equals the recall, 
REC), the precision (PRE), the accuracy (Acc.) and the Fl score corresponding to the model 
selected by the BIC procedure. 



Method 


lime (s) 


POS 


FPR 


tprt 


PRE 


Acc. 


Fl score 




Data 


generated with G = 








n = 100 
















ScpLogit OR 


21.94 


2.86 


0.043 


0.136 


0.540 


0.775 


0.234 


ScpLogit AND 


21.94 


2.12 


0.029 


0.112 


0.585 


0.780 


0.216 


BMNPscudo 


14.20 


8.16 


0.141 


0.322 


0.424 


0.740 


0.355 


BMNPseudo 1/2 


14.20 


2.60 


0.039 


0.122 


0.492 


0.774 


0.233 


GaussCor 


1.12 


2.56 


0.035 


0.132 


0.572 


0.780 


0.235 


n — 500 
















SpnTjOErit OR 


22.91 


4.80 


0.018 


0.416 


0.885 


0.856 


0.549 


SepLogit AND 


22.91 


4.12 


0.014 


0.364 


0.892 


0.848 


0.504 


BMNPscudo 


42.43 


9.48 


0.079 


0.670 


0.725 


0.865 


0.687 


BMNPseudo 1/2 


42.43 


4.60 


0.015 


0.408 


0.895 


0.857 


0.550 


GaussCor 


1.16 


4.48 


0.014 


0.400 


0.895 


0.856 


0.539 


n — 2500 
















SepLogit OR 


27.89 


9.52 


0.006 


0.93 


0.979 


0.980 


0.952 


SepLogit AND 


27.89 


9.32 


0.004 


0.918 


0.986 


0.979 


0.949 


BMNPseudo 


176.09 


11.66 


0.055 


0.972 


0.846 


0.951 


0.901 


BMNPseudo 1/2 


176.09 


9.36 


0.005 


0.920 


0.985 


0.979 


0.948 


GaussCor 


1.19 


9.30 


0.005 


0.912 


0.983 


0.976 


0.943 




Data 


generated with = O'^^ 






^ 1 nn 
n = iUU 
















SepLogit OR 


23.45 


4.68 


0.058 


0.266 


0.618 


0.792 


0.361 


SepLogit AND 


23.45 


2.76 


0.027 


0.182 


0.687 


0.797 


0.302 


BMNPseudo 


215.65 


9.82 


0.154 


0.444 


0.482 


0.757 


0.443 


BMNPseudo 1/2 


215.65 


3.22 


0.033 


0.208 


0.688 


0.799 


0.325 


GaussCor 


1.11 


3.80 


0.041 


0.236 


0.686 


0.798 


0.341 


n = 500 
















ScpLogit OR 


33.18 


7.20 


0.018 


0.658 


0.922 


0.910 


0.758 


ScpLogit AND 


33.18 


6.18 


0.007 


0.592 


0.960 


0.904 


0.720 


BMNPseudo 


97.70 


10.82 


0.087 


0.776 


0.735 


0.882 


0.747 


BMNPseudo 1/2 


97.70 


6.56 


0.014 


0.606 


0.934 


0.901 


0.720 


GaussCor 


1.16 


7.02 


0.014 


0.652 


0.931 


0.912 


0.758 


n = 2500 
















SepLogit OR 


34.52 


10.28 


0.011 


0.990 


0.966 


0.989 


0.977 


ScpLogit AND 


34.52 


9.92 


0.005 


0.976 


0.985 


0.991 


0.980 


BMNPseudo 


175.80 


11.70 


0.049 


0.998 


0.872 


0.961 


0.926 


BMNPseudo 1/2 


175.80 


10.16 


0.013 


0.972 


0.960 


0.984 


0.965 


GaussCor 


1.30 


10.06 


0.005 


0.988 


0.984 


0.993 


0.985 



T TPR=REC. 
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Table 4: Evaluation of the models selected by the BIC procedure; the case p = 10. Means 
(computed over 50 runs) are given for the computational time needed to compute the models 
on a grid of 50 equally-spaced A values as well as the number of detected associations (POS), 
the false positive rate (FPR), the true positive rate (TPR, which equals the recall, REC), the 
precision (PRE), the accuracy (Acc.) and the Fl score corresponding to the model selected 
by the BIC procedure. 



Method 


lime (s) 


POS 


FPR 


tprt 


pre 


Acc. 


Fl score 




Data 


generated with O = 








n = 100 
















ScpLogit OR 


30.86 


7.86 


0.055 


0.594 


0.775 


0.867 


0.660 


ScpLogit AND 


30.86 


5.44 


0.021 


0.470 


0.868 


0.866 


0.599 


BMNPscudo 


276.48 


12.18 


0.147 


0.702 


0.613 


0.819 


0.640 


BMNPseudo 1/2 


276.48 


6.44 


0.031 


0.534 


0.856 


0.872 


0.638 


GaussCor 


1.26 


7.42 


0.037 


0.612 


0.834 


0.885 


0.697 


n = 500 
















SepLogit OR 


33.10 


10.14 


0.027 


0.918 


0.912 


0.960 


0.912 


SepLogit AND 


33.10 


8.90 


0.009 


0.858 


0.967 


0.961 


0.906 


BMNPscudo 


163.20 


12.78 


0.098 


0.936 


0.760 


0.910 


0.830 


BMNPseudo 1/2 


163.20 


9.80 


0.030 


0.876 


0.906 


0.949 


0.885 


GaussCor 


1.26 


9.78 


0.014 


0.930 


0.957 


0.974 


0.940 


n = 2500 
















SepLogit OR 


40.76 


10.30 


0.009 


1.000 


0.973 


0.993 


0.986 


SepLogit AND 


40.76 


10.02 


0.001 


1.000 


0.998 


0.999 


0.999 


BMNPseudo 


176.22 


11.14 


0.033 


1.000 


0.914 


0.975 


0.951 


BMNPseudo 1/2 


176.22 


10.18 


0.005 


1.000 


0.984 


0.996 


0.992 


GaussCor 


1.26 


10.12 


0.003 


1.000 


0.989 


0.997 


0.994 



Data generated with Q — Q 

n = 100 



SepLogit OR 


30.63 


20.96 


0.443 


0.497 


0.452 


0.532 


0.470 


SepLogit AND 


30.63 


14.88 


0.322 


0.343 


0.440 


0.537 


0.383 


BMNPseudo 


193.43 


27.16 


0.577 


0.640 


0.446 


0.515 


0.523 


BMNPseudo 1/2 


193.43 


18.70 


0.405 


0.431 


0.440 


0.526 


0.431 


GaussCor 


1.87 


20.72 


0.452 


0.473 


0.432 


0.516 


0.445 


n = 500 
















ScpLogit OR 


36.10 


10.48 


0.038 


0.500 


0.916 


0.767 


0.642 


ScpLogit AND 


36.10 


8.30 


0.012 


0.421 


0.967 


0.749 


0.583 


BMNPseudo 


1090.87 


13.52 


0.107 


0.565 


0.815 


0.755 


0.657 


BMNPseudo 1/2 


1090.87 


9.84 


0.044 


0.458 


0.897 


0.746 


0.600 


GaussCor 


1.80 


10.70 


0.025 


0.529 


0.945 


0.787 


0.675 


n = 2500 
















SepLogit OR 


52.32 


15.26 


0.048 


0.737 


0.922 


0.861 


0.817 


ScpLogit AND 


52.32 


13.44 


0.016 


0.685 


0.971 


0.858 


0.802 


BMNPseudo 


2299.82 


17.40 


0.122 


0.748 


0.829 


0.823 


0.783 


BMNPseudo 1/2 


2299.82 


14.72 


0.056 


0.698 


0.905 


0.840 


0.786 


GaussCor 


1.86 


14.76 


0.026 


0.741 


0.957 


0.876 


0.834 



T TPR=REC. 
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Table 5: Evaluation of the models selected by the BIC procedure; the case p = 50. Means 
(computed over 50 runs) are given for the computational time needed to compute the models 

on a grid of 50 equally-spaced A values as well as the number of detected associations (POS), 
the false positive rate (FPR), the true positive rate (TPR, which equals the recall, REC), the 
precision (PRE), the accuracy (Acc.) and the Fl score corresponding to the model selected 
by the BIC procedure. 



Method 


Time (s) 


POS 


FPR 


tprt 


PRE 


Acc. 


Fl score 


Data 


generated 


with e 


= diag(0 










n = 500 
















SepLogit OR 


236.21 


39.00 


0.016 


0.399 


0.515 


0.960 


0.447 


SepLogit AND 


236.21 


30.10 


0.011 


0.352 


0.588 


0.963 


0.438 




o±.oo 




0.013 


0.376 


0.562 


0.962 




n = 2500 
















oepLogit UK 


^ AO T/t 

240.74 


t: A AO 

54.48 


0.006 


0.938 


0.864 


0.991 


0.898 


bepLogit AND 


^ AO 'y A 


50.64 


0.004 


0.921 


0.910 


0.993 


0.915 


GaussCor 


zo.oO 


DZ.04 


0.005 


0.918 


0.881 


0.991 


n QQQ 


Data 


generated 


with e 


= diag(e(2\e(2) 






n = 500 
















SepLogit OR 


285.91 


53.36 


0.017 


0.661 


0.626 


0.970 


0.640 


SepLogit AND 


285.91 


34.14 


0.007 


0.527 


0.778 


0.974 


0.626 


i-r Jin Q Q 1; r» r 


32.75 


45.42 


0.011 


0.642 


0.717 


0.975 


0.673 


n = 2500 
















oepLogit OR 


304.09 


59.26 


0.008 


0.988 


0.837 


0.991 


0.905 


bepLogit AND 


304.09 


50.80 


0.003 


0.950 


0.937 


0.995 


0.943 


GaussCor 


29.00 


54.40 


0.004 


0.995 


0.917 


0.996 


o nc /I 
U.9o4 


Data 


generated 


with e 


= diag(0 






(3)^e(--')) 




n = 500 
















SepLogit OR 


305.44 


68.66 


0.019 


0.936 


0.685 


0.980 


0.790 


SepLogit AND 


305.44 


47.04 


0.005 


0.834 


0.888 


0.989 


0.859 


GaussCor 


30.34 


60.34 


0.010 


0.968 


0.811 


0.989 


0.880 


n = 2500 
















SepLogit OR 


380.86 


57.72 


0.007 


1.000 


0.868 


0.994 


0.929 


SepLogit AND 


380.86 


51.50 


0.001 


1.000 


0.971 


0.999 


0.985 


GaussCor 


25.48 


50.48 


0.000 


1.000 


0.991 


1.000 


0.995 




Data generated with 6 = 6*^* 






n = 500 
















SepLogit OR 


271.84 


95.20 


0.017 


0.610 


0.802 


0.945 


0.692 


SepLogit AND 


271.84 


71.00 


0.007 


0.508 


0.894 


0.944 


0.647 


GaussCor 


74.71 


86.96 


0.013 


0.582 


0.839 


0.946 


0.685 


n = 2500 
















SepLogit OR 


307.38 


129.00 


0.008 


0.960 


0.931 


0.989 


0.945 


SepLogit AND 


307.38 


117.48 


0.002 


0.922 


0.981 


0.990 


0.950 


GaussCor 


65.41 


123.68 


0.007 


0.927 


0.938 


0.986 


0.932 



T TPR=REC. 
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3.6 Agreement between the compared methods 

One way to measure the agreement between two selected models is to compare them with 
their intersection. Let Gi = (y,Ei) and G2 = iV,E2) be the two graphs to be compared. 
Further let E* = Ei n E2 and denote by '^E the cardinality of a set E of edges. To compare 
graphs Gi and G2, we will consider the quantities 

^{01,02) = . (19) 
mm{^Ei4E2) 

-<Qi,G2) = (tti^i - tti^'^) + (tJi?2 - P*), (20) 

as measures of agreement and disagreement respectively {R is the cardinality of the symmetric 
difference between Ei and E2). Observe that according to these measures, the saturated graph 
would both agree and disagree "a lot" with any sub-graph. 

Results are presented in Table [6] for p = 10 and = {0*^^^ O^*^^} which correspond to the two 
extreme situations in terms of signal-to-noise ratio. Overall, BMNPseudo is closer to SepLogit 
than GaussCor, what was expected given the respective principles of the methods. Moreover, 
agreement [resp. disagreement] between the various models is higher [resp. lower] when the 
signal-to-noise ratio is high, that is when n = 2500 and/or Q = Q^^\ When the signal-to-noise 
ratio is low and models are quite different, a natural question is how to get the best model. 
Intersecting two models is one of the candidate approaches. For the sake of brevity, we do not 
present the complete results here, but intersecting GaussCor and SepLogit OR for instance 
resulted in quite conservative models that generally achieved better performances than either 
GaussCor or SepLogit OR (in terms of accuracy and Fl-score). 

Models derived under method SepLogit with standardized covariates were also compared to 
the other models (results not shown). Overall, we observed very good agreement between 
the standard approach and the standardized approach (especially in the case of high signal- 
to- noise ratio). We also observed slightly better agreements between these models and those 
derived under method GaussCor, especially on datasets generated using matrix 9(4). 

3.7 Comparison of the conditional odds-ratios estimates 

Both methods SepLogit and BMNPseudo have been empirically shown to yield appropriate 
estimates for the conditional odds-ratios. On the other hand, it is rather unclear whether 
estimates derived from methods relying on the Gaussian approximation would be consistent. 
We therefore conclude this simulation study with a simple study to check it. 

To avoid interpretation difficulties related to the model selection accuracy, coefficient estimates 
were computed under the sparsity structure of the true models and compared with the true 
coefficients (this corresponds to the situation where the true sparsity structure is known). To 
do so, we used an approach similar to the one used to get un-shrunk estimates for the BIG 
procedure. 

The mean squared errors for the conditional log-odds ratios, which we defined here as 

MSE = 1000 X ^fe>^(^M-M' (21) 



19 



Table 6: Results of the simulation study: agreement evaluation. Means (along with standard 
deviations) of the criteria defined in (jl9p -()20p were obtained from 50 runs for various sample 
sizes n and matrices Q. 



Comparisons 


Agreement k 


Disagreement k 


Data generated with O — 


% n 


= 100 






SepLogit OR SepLogit AND 


1.000 


(0 000) 


2.420 


(1.605) 


SepLogit OR BMNPseudo 


0.970 


(0.071) 


2.140 


(1.807) 


SepLogit AND BMNPseudo 


0.965 


(0 099") 


1.720 


(1.578) 


BMNPseudo GaussCor 


0.941 


(0.087) 


2.940 


(1.463) 


SepLogit OR GaussCor 


0.941 


(0.077) 


2.480 


(1.632) 


SepLogit AND GaussCor 


0.940 


(0 08,6) 


3.060 


(1.476) 


Data generated with O — 


e'''), n= 


=2500 






SepLogit OR SepLogit AND 


1.000 


(0.000) 


0.280 


(0.497) 


SepLogit OR BMNPseudo 


1.000 


(0.000) 


0.160 


(0.370) 


SepLogit AND BMNPseudo 


0.998 


(0.013) 


0.200 


(0.495) 


BMNPseudo GaussCor 


1.000 


(0.000) 


0.100 


(0.303) 


SepLogit OR GaussCor 


1.000 


(0.000) 


0.220 


(0.418) 


SepLogit AND GaussCor 


0.998 


(0.013) 


0.140 


(0.405) 


Data generated with O — 


6(4), n 


= 100 






SepLogit OR SepLogit AND 


1.000 


(0.000) 


6.080 


(2.146) 


SepLogit OR BMNPseudo 


0.952 


(0.048) 


4.800 


(2.711) 


SepLogit AND BMNPseudo 


0.957 


(0.053) 


5.400 


(2.148) 


BMNPseudo GaussCor 


0.967 


(0.054) 


6.740 


(2.863) 


SepLogit OR GaussCor 


0.896 


(0.081) 


8.380 


(2.900) 


SepLogit AND GaussCor 


0.931 


(0.081) 


9.900 


(3.112) 


Data generated with O — 


ew, n= 


=2500 






SepLogit OR SepLogit AND 


1.000 


(0.000) 


1.820 


(1.424) 


SepLogit OR BMNPseudo 


0.981 


(0.031) 


1.880 


(1.335) 


SepLogit AND BMNPseudo 


0.983 


(0.030) 


1.900 


(1.359) 


BMNPseudo GaussCor 


0.950 


(0.050) 


2.900 


(1.474) 


SepLogit OR GaussCor 


0.972 


(0.035) 


2.220 


(1.183) 


SepLogit AND GaussCor 


0.980 


(0.035) 


2.200 


(1.666) 
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Table 7: Results of the simulation study: evaluation of the conditional log-odds-ratios estima- 
tion. Means of the mean squared error (MSE) defined in (12ip were obtained from 50 runs for 
various sample sizes n and matrices @. Three methods were evaluated: the genuine method 
of Banerjee et al. (2008) GaussCov 1/3, its modification relying on the correlation matrix 
GaussCor and the method relying on multiple logistic regressions SepLogit. 



Sample size GaussCov 1/3 


GaussCor 


SepLogit 


Data generated with 


e = 






n = 500 2.613 




2.073 


1.957 


n = 2500 2.140 




0.451 


0.435 


Data generated with 


e = 


9(2) 




n = 500 14.568 




7.743 


6.175 


n = 2500 14.342 




6.573 


1.201 


Data generated with 


e = 


e(3) 




n = 500 58.502 




29.197 


10.704 


n = 2500 57.451 




26.779 


1.739 


Data generated with 


e = 






n = 500 98.304 




68.034 


50.007 


n = 2500 97.402 




65.973 


12.794 



are reported in Table[7]for methods SepLogit, GaussCov 1/3 and GaussCor in the case p = 10 
and for samples of sizes n = 500 and n = 2500 (for SepLogit each coefficient Oi^- i was set as 
the mean of the two coefficients returned by the two constrained logistic regressions involved 
in this method). It can be observed that overall neither GaussCov 1/3 nor GaussCor led 
to consistent estimates for the 6k/ coefficients. These methods (especially GaussCor) should 
therefore be combined with other methods (for instance SepLogit) when estimates of the 
conditional odds-ratios are needed. 

Inconsistency of the estimates derived under method GaussCor can be explained as follows. 
Since this method relies on the correlation matrix, it is clo sely related to the method consistin; 



in performing linear regressions at each node (as shown by lMeinshausen and Biihlmannl (120061 ) 
in the Gaussian case). Therefore, the coefficients returned by this method are related to the 
coefficients jk/ involved in the linear model 

E(x(^')|x(-fc)) = P(x('^) = l|x(-'=))= j, + J2lk,ex'^'^- (22) 

Clearly, coefficients 'jk/ are quite different from the conditional log odds-ratios 9^/ involved 
in the Ising model (see ([I])). 

4 Application to the search for associations among causes of 
death 

The general objective of the application in this example is to detect associations between 
causes of death and identify possibly relevant groupings of causes contributing to the death. 
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4.1 Description of the data 



The dataset we used consists in causes of death recorded in all death certificates (n = 535 684) 
in France for the year 2005. In Prance, death certification (compulsory with 100% coverage) 
conforms to the WHO guideline: the death process is described starting from the underlying 
causes of death and ending with the immediate cause of death; other contributing causes 
of death are also recorded. All these causes were considered in the analysis. Causes are 
further coded according to the International Cl assification of Diseases (ICD)(in 2005, the 
tenth revision (jWorld Heath Organization! . Il994l )). The total number of code categories is 
large (about 4000 codes used) but for this an a.lysis we appl i ed a simplified classification of 
59 entities according to the Eurostat shorthst ( EURO STAT . 20021 ) (see Appendix A for the 
classification used). Therefore, p = 59 and each death certificate can be regarded as a vector 
X = {x^^\ x^^^^) in {0, 1}^^, where, for all k = 1, 59, x^'^) = 1 if and only if the k-th class 
is recorded on the death certificate. About 11% of the certificates had more than five causes 
of death, 14% had four causes, 26% had three, 30% had two and 18% had a sole cause of 
death. The frequencies of each cause are reported in Appendix A; the most frequent causes 
of death were, in decreasing order, heart failure, ischaemic heart diseases, cerebrovascular 
diseases, hypertensive diseases, pneumonia, diabetes, lung cancer and senility. 



4.2 Graphical model analysis 

Most common causes of death clearly depend upon age and gender, and a natural question is 
whether associations among causes of death also vary with age and gender. We then decided 
to split the whole population into strata defined by gender and age The complete analysis of 
every stratum is out of the scope of the present paper, and we only present here the results 
from the analysis of two sub-groups, namely those of males aged between 15 and 24 (2918 
observations) and males aged between 45 and 64 (57045 observations). 

First considering the group of age 15-24, we applied GaussCor, BMNPseudo, SepLogit AND, 
and SepLogit OR after selecting the sparsity parameter according to the BIC procedure de- 
scribed above. This yielded models with 113, 107, 61 and 129 associations respectively. Good 
agreement was observed between models derived under methods SepLogit and BMNPseudo 
(we had k = 0.935 for the comparison between BMN and SepLogit OR for instance). However, 
the model derived under method GaussCor was slightly different from the three other models 
(we had k = 0.700 for the comparison between GaussCor and SepLogit OR and k = 0.918 for 
the comparison between GaussCor and SepLogit AND). More precisely, the model obtained 
with method GaussCor entailed many more positive associations, a few of which correspond- 
ing to variables co-occurring only once in the sub-group. This suggests that method GaussCor 
might be a little too sensitive to positive associations, especially when the signal-to-noise ratio 
is low (in this particular study, the signal-to-noise ratio is low due to highly unbalanced vari- 
ables). Regarding computational time, 1.6 second was needed to compute method GaussCor 
while it took 19628 and 876 seconds for computing methods BMNPseudo and SepLogit re- 
spectively (analyses were performed on the Windows machine). For these latter two methods, 
we were not able to conduct the analysis with the choice A™"^ = A^^-^/lOOO, and we had to 
select A'"^'^ = A™^''/50 and A"'^'^ = A^^^'/lOO for methods BMNPseudo and SepLogit respec- 
tively. It is also noteworthy that the computational time needed for methods BMNPseudo and 
SepLogit is mostly due to the computation of the un-shrunk estimates (necessary for the 
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derivation of the BIC); omitting this step, the computational time using method BMNPseudo 
[resp. SepLogit] is reduced to 4598 seconds [resp. 198 seconds]. 

Figure m shows the final retained model for the group 15-24, which is the intersection of the 
models derived under methods SepLogit OR and GaussCor. Apart from the obvious asso- 
ciation between depression and suicide, the strongest (positive) associations identified were 
between diabetes and other endocrinal diseases, colorectal cancer and metastasis, septicemia 
and pneumonia, and between diseases of arteries, arterioles and capillaries and cerebrovascular 
diseases. The strong negative associations between transport accidents and all other condi- 
tions, and between suicide and most other conditions (except depression and other mental 
disorders) is also worth noting. These latter associations correspond to well-known sequences 
of causes leading to death and most of those present in the figure have strong biological 
plausibility. 

In the analysis of the older group, we only applied methods SepLogit and GaussCor (to save 
computational time), which took 15241 and 4.8 seconds respectively. In this case, we had to 
use A™^'^ = A™^/50 for method SepLogit. Moreover, when omitting the computation of the 
un-shrunk estimates, the computational time using method SepLogit reduced to 1943 seconds. 
600, 778 and 708 associations were detected by method SepLogit AND, SepLogit OR and 
GaussCor respectively. Good agreement was observed between the models (we had k = 0.933 
for the comparison between GaussCor and SepLogit OR and k = 0.953 for the comparison 
between GaussCor and SepLogit AND), which tends to confirm that agreement between the 
models returned by methods SepLogit and GaussCor increases with the signal-to-noise ratio. 



5 Discussion 

In this paper we empirically compared several approximate methods designed to search for 
associations among binary variables. We observed that methods SepLogit and BMNPseudo 
achieved similar performances in terms of accuracy and Fl-score, with a slight advantage to 
method SepLogit. Moreover, the models selected by both methods are very similar in most 
cases, as could be expected given the similarity between them. In terms of computational 
time, SepLogit appeared to be overall faster than BMNPseudo, but the two methods share 
the disadvantage of being quite slow to compute, especially for low values of the sparsity 
parameter. 

For the method BMNPseudo, we observed that using half the pseudo-likelihood rather than 
the pseudo-likelihood itself when computing the BIC enables us to select better models in 
most cases. The multiplicative coefficient 1/2 might not be optimal in all situation and some 
adaptive coefficient might be derived from a theoretical study of the pseudo-likelihood. Al- 
ternatively, cross-validation could be considered at a cost of an increased computational time, 
which seems undesirable given the aforementioned lack of speed of this method. Moreover , 
the suitability of cross-validation for model selection remains questionable ( Gao et al. . 20091 ). 



In terms of accuracy, the method proposed by Baneriee et al. ( 20081 ) was shown to be g( 



gen- 



erally too conservative. We then proposed a slight modification, referred to as GaussCor, in 
which we remove the additive 1/3 term on the diagonal, and use the sample correlation ma- 
trix as a starting point for the algorithm. With these modifications, GaussCor combines good 
overall performances (comparable to the performances achieved by SepLogit and BMNPseudo) 
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hypeitens perlnat_cond skirr_dis 
tuberc cJnng CLpancreas 

c_steniac c_eral isc_heBrt_dis 
malnut dementia embelism 




Figure 4: Graphical model obtained with Cytoscape on the real data set (males aged between 
15 and 24 years). Positive associations (solid lines) and negative associations (dashed lines) 
are presented. The line widths of edges are proportional to the conditional log-odds-ratios 
(estimated using multiple logistic regressions built under the constraint implied by the retained 
model). For instance, the (absolute value of the) conditional log-odds-ratio was about 4.5 for 
the association depression/suicide, 3 for the association liver disease/other diseases of the 
digestive system, 1.7 for the association other infectious disease/other endocrinal disease, and 
0.35 for the association other mental disordcrd/suicide. Similarly, vertices are represented 
by balls with diameter related to the observed frequency of the causes of death. Transport 
accidents were reported on about 40% of the death certificates while falls were only reported 
on 1.2% of the death certificates. Conditions listed on the left side axe not associated with 
any other condition or disease. 
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and exceptional computational speed. In particular, GaussCor was observed to be between 
3 and 200 times faster than the other methods on simulated data. This speed is partic- 
ularly desirable for handling truly high-dimensional datasets since the concurrent methods 
(SepLogit or BMNPseudo) might be dramatically slow in such cases. To be complete, we 
should mention that method SepLogit could be implemented using other sparse log i stic re - 



gression algor i thms t hat might be fas ter than the glmnet R package (see Koh et al. ( 200?! ) 



Genkin et al. ( 20071 ): Lee et al. ( 20061 ) for instance). However, we think that the compari- 
son conducted her e was fair since both R packages glmnet and glasso rely on a coordinate 



descent algorithm ( Friedman et al. . 2008bl ). 



Interestingly, we also observed that the models selected by methods GaussCor and SepLogit 
can be significantly different, especially in the situation of low signal-to-noise ratio. On our 
real example, we decided to retain the intersection of the two selected models as the final 
model, which led to conservative but competitive models on simulated examples. However, 
other approaches can be considered and it would be interesting to further study how these 
methods can be optimally combined. 

Approximate methods that use either multiple logistic regressions or the pseudo-likelihood 
have been shown to attai n performances similar to tho se reached using exact inference at a 
lower computational cost ( Hofling and Tibshirani . 20091 ). Our results suggest here that using 
Gaussian approximates of the Ising likelihood can ensure similar statistical performance at a 
greatly improved speed. In the absence of a theoretical justification for the good performances 
achieved by this method however, we can only claim here that GaussCor is a candidate method 
that can be recommended in some cases; a theoretical study might enable to have a better 
idea of what these cases are. 
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Appendix: The retained classification of causes of death 



Disease / Condition 



Label 



ICD-10 codes 



Count 



Septicaemia septicaemia 

tuberculosis tuberc 

aids and HIV infection aids 

Other infectious disease otherjnfect 

Oral cancer c_oral 

Oesophageal cancer c_oesoph 

Stomach cancer c_cstomac 

Colorectal cancer c_colorect 

Liver cancer cJiver 

Pancreas cancer c_pancreas 

Larynx cancer cJarynx 

Lung cancer cjung 

Breast cancer c_breast 

Uterus cancer c_uterus 

Prostate cancer c_prostate 

Bladder cancer c_bladder 

Hodgkin's disease and leukemia cjiodgkin 

Secondary malignant neoplasm c_metasta 

Other cancers c_other 



Diseases of the blood blood_dis 

Diabetes diabetes 

Malnutrition malnut 

Other endocrinal disease other_endo 

Dementia dementia 
Mental disorders due to use of alco- alcohoLment 
hoi 

Mood disorders depression 

Other mental disorders otherjnent 

Parkinson's disease parkinson 

Alzheimer's disease Alzheimer 

Other diseases of the nervous sys- other_nervous 
tem,the eye and adnexa 

Hypertensive diseases hypcrtcns 

Ischaemic heart diseases isc_heart_dis 

Pulmonary embolism,phlebitis and embolism 
thrombophlebitis 

Cardiac arrhythmias card_arrhytm 

Heart failure heart_fail 

Cerebrovascular discEises cere_v£isc 

Diseases of arteries, arterioles and arteries 

capillaries 

Other diseases of the circulatory other_circ 
system 

Influenza (other than avian in- influenza 
fluenza) 



B90 



A40-A41 
A15-A19, 

B20-B24 

A00-A14, A20-A39, A42-B19, B25- 

B89, B91-B99 

C00-C14 

C15 

C16 

C18-C21 
C22 
C25 
C32 

C33-C34 
C50 

C53-C55 

C61 

C67 

C81-C96 
C77-C79 

C17, C23-C24, C26-C31, C35-C49, 
C51-C52, C56-C60, C62-C66, C68- 
C76, C80, C97-D49 
D50-D89 
E10-E14 
E40-E46 
E00-E09, 
F01-F03 
FIO 



E15-E39, E47-E90 



F30-F39 

FOO, F04-F09, F11-F29, F40-F99 

G20 
G30 

G00-G19, G21-G29, G31-H95 

110-115 
120-125 
126, 180-182 

147-149 

150 

160-169 
170-179 



100-109, 
183-199 
JlO-Jll 



116-119, 127-146, 151-159, 



25713 
1720 

1094 
14188 

5076 
4654 
5642 
19587 
8528 
8615 
2062 
30642 
14439 
3708 
13361 
5946 
15574 
50314 
72943 



12955 
32704 
12713 
27490 
22966 
12634 

8628 

15629 
8598 
22568 
27028 

44117 
62071 
16697 

38020 

73268 
58161 
25567 

80060 

1192 
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Disease / Condition 


Label 


ICD-10 codes 


Count 


Pneumonia 


pneumo 


J12-J18 


38677 


Asthma and status asthmaticus 


asthma 


J45-J46 


2886 


Other chronic lower respiratory dis- 


other Jow_resp 


J40-J44, J47 


17739 


eases 

Lung diseases due to external 


lung_extern 


J60-J70 


10137 


agents 








Other diseases of the respiratory 


other _resp 


J00-J09, J19-J39, J48-J59, J71-J99 


59123 


system 








Peptic ulcer 


pept_ulcer 


K25-K28 


1965 


Diseases of liver 


liver_dis 


K70-K77 


20661 


Other diseases of the digestive sys- 


other_digest 


K00-K24, K29-K69, K78-K99 


34507 


tem 








Diseases of the skin and subcuta- 


skin_dis 


L00-L99 


11056 


neous tissue 








Diseases of the musculoskeletal sys- 


musculoskeletal 


M00-M99 


9644 


tem and connective tissue 








Diseases of the genitourinary sys- 


genito_urinary 


N00-N99 


37683 


tem 








Pregnancy, childbirth and the puer- 


pregnancy 


000-099 


67 


perium 








Certain conditions originating in 


perinat_cond 


P00-P96 


2100 


the perinatal period 








Congenital malformations, defor- 


congenit_malf 


Q00-Q99 


2411 


mations and chromosomal abnor- 








malities 








Senility 


senility 


R54 


23646 


Other symptoms and abnormal 


other_undef 


R00-R53, R55-R59 


c\r^ 

252979 


clinical findings, not elsewhere clas- 








sified 








Transport accidents 


transp_acc 


V01-V99 


5686 


Falls 


falls 


W00-W19 


6217 


Intentional self-harm 


suicide 


X60-X84 


10900 


Other external causes of morbidity 


other .extern 


W20-X59, X85-Y89 


21813 


and mortality 
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